Question 1

a.)

library(ggplot2)
library(plotly)
muscle <- read.table("muscle.txt")
colnames(muscle) <- c("mass", "age")
head(muscle, 10)
##    mass age
## 1   106  43
## 2   106  41
## 3    97  47
## 4   113  46
## 5    96  45
## 6   119  41
## 7    92  47
## 8   112  41
## 9    92  48
## 10  102  48
ggplotly(ggplot(muscle, aes(x = mass)) + geom_histogram(bins =15, fill = "#FF6347"))
ggplotly(ggplot(muscle, aes(x = age)) + geom_histogram(bins = 15, fill = "#FFA07A"))

We can see from the plots that mass has a slight skew to the right while the age histrogram has a strong right skew.

ggplotly(ggplot(muscle, aes(x = age, y= mass)) + geom_point(color = "#FFC300"))

We can see that from the scatter plot that the data seems to have a negative linear relationship.

b.)

library(MASS)
## 
## Attaching package: 'MASS'
## The following object is masked _by_ '.GlobalEnv':
## 
##     muscle
## The following object is masked from 'package:plotly':
## 
##     select
bxcx <- boxcox(mass ~ age, data = muscle)

lambda <- bxcx$x[which.max(bxcx$y)]
lambda
## [1] 1.151515

Since our Box Cox 95% confidence interval contains 1, we can conclude that a transformation of our variable is unneccessary.

c.)

#Build the model
modelMM <- lm(mass ~ age, data = muscle)
sumModelMM <- summary(modelMM)

# Coefficients and their standard errors
sumModelMM$coefficients
##               Estimate Std. Error   t value     Pr(>|t|)
## (Intercept) 156.346564 5.51226249  28.36341 1.085415e-35
## age          -1.189996 0.09019725 -13.19326 4.123987e-19
# MSE
MSE <- sum(sumModelMM$residuals^2) / (nrow(muscle) - 2)
MSE
## [1] 66.80082
# Degrees of Freedom for MSE
sumModelMM$df[2]
## [1] 58

d.) The regression line for the data is the following: \[ \hat{Y_i} = 156.347 - 1.190X_i \]

ggplotly(
  ggplot(muscle, aes(x = age, y = mass)) + geom_point(color = "#FFC300") + geom_abline(
    intercept = 156.346564,
    slope = -1.189996,
    col = "#003cff"
  )
)
## Warning: `gather_()` was deprecated in tidyr 1.2.0.
## Please use `gather()` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was generated.

We can see that our fitted regression line does fit the data very well.

e.)

# Fitted values for 6th and 16th value 
modelMM$fitted.values[c(6,16)]
##         6        16 
## 107.55675  90.89681
#Residuals for 6th and 16th cases
modelMM$residuals[c(6,16)]
##         6        16 
## 11.443252 -3.896811

f.) The Normal Error Model is written as the following: \[ Y_i = \beta_0 + \beta_1X_i + \epsilon_i \] with \(i = 1, \dots, n\) where the error terms are independently and identically distributed (i.i.d.) \(N(0, \sigma^2)\) random variables.

#Errors vs Fitted Values 
ggplot() + aes(x = modelMM$fitted.values, y = modelMM$residuals) + geom_point() + labs(x = "Fitted Values", y = "Residuals") + geom_abline(intercept = 0, slope = 0)

# ggplot(modelMM, aes(qqnorm(.stdresid)[[1]], .stdresid)) + geom_point() + geom_abline() + labs(x = "Theoretical Quantiles", y = "Standardized Residuals", title = "Normal Q-Q")

## Normal Q-Q plot
ggplot(muscle, aes(sample = mass)) + stat_qq() + stat_qq_line() + labs(x = "Theoretical Quantiles", y = "Sample Quantiles", title = "Normal Q-Q")

# {qqnorm(muscle$mass)
# qqline(muscle$mass)}

Both of our residual plots suggest that our normality assumption is met. The Error v.s. Fitted Values plot shows no patterns, with random scatter through out the plot. The Normal Q-Q plot follows fairly close to the standard line, with no strong deviation which would indicate skewing or heavy tails.

g.) We test: \[ H_0: \beta_0 = 0 \\ H_a: \beta_0 \neq 0 \]

xbar <- mean(muscle$age)

as.numeric(
  c(
    modelMM$coefficients[1] - qt(1 - 0.01 / 2, nrow(muscle) - 2) * sumModelMM$coefficients[1, 2],
    modelMM$coefficients[1] + qt(1 - 0.01 / 2, nrow(muscle) - 2) * sumModelMM$coefficients[1, 2]
  )
)
## [1] 141.6658 171.0273

We are 99% confident that the true value for \(\beta_0\) is contained within the interval (141.6658, 171.0273).

h.) We test the hypothesis: \[ H_0: \beta_1 \geq 0 \\ H_a: \beta_1 < 0 \] With test statistic: \[ T^* = \frac{\hat{\beta_1} - \beta_1^{(0)}}{s(\beta_1)} \\ T^* \sim t_{(n-2)} \] We reject \(H_0\) if and only if \(T^* < t(\alpha, n-2)\).

Tstar <- sumModelMM$coefficients[2,1] / sumModelMM$coefficients[2,2]
Tstar
## [1] -13.19326
Tstar < qt(0.01, nrow(muscle) - 2)
## [1] TRUE

Since our T statistic is less than our critical value, we reject our null hypothesis and conclude that there is a negative linear association between the amount of muscle mass and age for a woman.

i.)

predict(
  modelMM,
  newdata = data.frame(
   age = 60
  ),
  interval = "prediction",
  level = 0.95
)
##        fit      lwr     upr
## 1 84.94683 68.45067 101.443

We predict with 95% confidence that the muscle mass for a woman aged 60 will be contained within the interval (68.451, 101.443).

j.) We test the hypothesis: \[ H_0: \beta_1 = 0 \\ H_a: \beta_1 \neq 0 \] With test statistic: \[ F^* = \frac{MSR}{MSE} \\ F^* \sim F_{(1, n-2)} \] We reject \(H_0\) if \(F^* > F(1 - \alpha, 1, n-2)\).

anova(modelMM)
## Analysis of Variance Table
## 
## Response: mass
##           Df  Sum Sq Mean Sq F value    Pr(>F)    
## age        1 11627.5 11627.5  174.06 < 2.2e-16 ***
## Residuals 58  3874.4    66.8                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

At \(\alpha = 0.01\) and a p-value of \(2.2e-16\), we reject \(H_0\), concluding that there is a linear association between the amount of muscle mass and age.

k.)

#Proportion of total variation is measured by R^2
sumModelMM$r.squared
## [1] 0.7500668
cor(muscle$age, muscle$mass)
## [1] -0.866064

75.01% of the variation in muscle mass is explained by the variation in age. We can also see that muscle mass and age have a correlation value of -0.8661, indicting a strong, negative relationship.

Question 2

For the plot in the top left, we can see that the distribution is right skewed, given that the points are condensing in lower values of the sample quantiles. For the top right plot, we can see this data has lighter tails and more condensed data in the center, making a more narrow distribution. For the bottom left, this seems to be a bimodal plot, given the two clusters at the bottom left and top right of the qqplot. For the final plot, the bottom right, this is data that looks to be approximately normal, with no obvious signs of skewness.